1 FastQ Files

Get the input files

Copy the archive containing the data.

cd 
cp /DATA/bioinfo/data.tar ~
tar xvf data.tar

This archive contains (among other) the FastQ files for 3 individuals

  • father
  • mother
  • child

Paired-end sequencing implies that each individuals has 2 FastQ files

  • xxxx.R1.fastq.gz for the Forward Reads
  • xxxx.R2.fastq.gz for the matching Reverse Reads

The .gz extension indicates that the file has been compressed

The files are :

child.R1.fastq.gz
child.R2.fastq.gz
father.R1.fastq.gz
father.R2.fastq.gz
mother.R1.fastq.gz
mother.R2.fastq.gz

1.1. File Format

FastQ Files

FastQ files are text file (human readable). It includes the reads coming from the sequencer. Each read is composed of 4 lines :

  1. starts with a @ symbol, followed by a sequence identifier.
  2. the actual biological sequence.
  3. a + symbol
  4. the quality scores corresponding to the bases in Line 2.

1.1.1. Count the lines in the R1 file for the child. Check that it is a multiple of 4.

zcat DATA/child.R1.fastq.gz | wc -l
bash

1.1.2. Count the lines in the R2 file for the child. Is it equal or different ? Why ?

zcat DATA/child.R2.fastq.gz | wc -l
bash
Paste

FastQ files are a succession of 4 lines groups :

id1
seq1
+
qual1
id2
seq2
+
qual2
...
idn
seqn
+
qualn

by using paste - - - - we can group data into 4 columns :

id1     seq1    +    qual1
id2     seq2    +    qual2
...
idn     seqn    +    qualn

The command cut -f can then extract the selected column

1.1.3. Compare the IDs of the 10 first reads. What are the differences between R1 and R2 IDs ?

zcat DATA/child.R1.fastq.gz | paste - - - - | cut -f 1 | head
zcat DATA/child.R2.fastq.gz | paste - - - - | cut -f 1 | head
bash

1.1.4. What are the lengths of the Short-Reads ?

zcat DATA/child.R1.fastq.gz | paste - - - - | cut -f 2 | head
zcat DATA/child.R1.fastq.gz | paste - - - - | cut -f 2 | head -1 | tr -d '\n'| wc -c
bash

1.1.5. Extract all the bases from the R1 reads of the child. Count the occurrences of each nucleotides.

zcat DATA/child.R1.fastq.gz | paste - - - - | cut -f 2 | head -100000 | grep -o . | sort | uniq -c
bash
grep

here grep -o . means "show each character on a line"
head -100000 limits to 100,000 reads for faster processing

1.2. Quality Score

1.2.1. What are the character present in the score lines ?

zcat DATA/child.R1.fastq.gz | paste - - - - | cut -f 4 | head -100000 | grep -o . | sort | uniq | paste - - - - - -
bash
Phred33 Score
  • ASCII code from 33 ! to 126 ~: 93 scores available
    asciiscore.png
  • Score = Phred
  • Probability of incorrect identification:
    , , ,...

Example: J ➡ code ASCII 74

1.3. Quality Control

Quality Control of the fastq files produced by the sequencer

  • Quality preservation along the Reads
  • Overall Quality of each Read
  • GC% offset and base distribution might indicate contamination
Check fastq file with fastqc
fastqc
bash

File, Open : DATA / child.R1.gz

1.3.1. Good Per Base Quality

1.3.2. Bad Per Base Quality

1.3.3. Good Per Sequence Quality

1.3.4. Bad Per Sequence Quality

1.3.5 mutiqc

multiqc

Usually, we don't want to test each fastq file.

mkdir report
fastqc DATA/*fastq.gz --out report
multiqc report/*.zip
bash
Examine QC

Download the multiqc_report.html :
On Windows

  • Refresh the file browse in mobaxterm
  • Copy the multiqc_report.html file to windows
  • Open it

On Mac
open another terminal
type scp -P 4444 XXXXX@methionine:multiqc_report.html . (where XXXXX is your login)
Open the file

Next Step: 2 Alignments